Let’s play together: Collaborative Data Science

Data Science Conference 4.0

Mario Annau

September 19, 2018

Why is it so hard?

  • Data Science is an interdisciplinary field.
  • Most scientists care more about methods than code.
  • Most engineers care more about code than methods.
  • Psychological barriers exists for people to collaborate.

Why is it so important?

  • Review of models and code improves overall quality.
  • Collaboration can generate new ideas.
  • Network effects if more people work efficiently together.

Network Effects

Improving Network Effects

  • How can code be managed to have positive network effects?
  • How can teams efficiently communicate and collaborate together?

Case study: The CRAN package repository

CRAN Packages Published

Authors per Package

Package Redundancy

  • Lack of communication between authors can lead to redundant packages.
  • Redundancy not helpful for infrastructure packages.
  • Example: R-Excel Package
  • Example: HDF5 package development

HDF5 packages

  • Store large amounts of data, e.g. tick data
  • Unsatisfied with rhdf5, hdf5, h5r, … → h5

Image Title

2 years ago …

  • Presentation of h5 at R/Finance 2016
  • Rcpp to interface HDF5 C++ API
  • Basic HDF5 features implemented

… 2 months later …

On June 21, 2016 Holger wrote:

… my name is Holger Hoefling, I have developed a new version of a wrapper library for hdf5 (R6 Classes, almost all function calls wrapped, full support for all datatypes including tables etc) …

And I replied:

On June 21, 2016 Mario wrote:

sounds interesting!

What’s different in hdf5r?

  • Automatic code generation against HDF5 C API
  • Usage of R6 (instead of S4) classes
  • Close connections during garbage collection
  • Broad coverage of low-level library features

Merging codebases

  • Maintain high-level interface and test cases from h5
  • Get low-level HDF5 support within R

Merge Git

On Oct 10, 2016 Holger wrote:

thanks - merged!

The Joys Collaboration

(after overcoming psychological barriers)

  • Code reviews
  • Higher Quality Code
  • End product of higher qualtity than separate packages.

Q: How can code be managed to have positive network effects?

  • Put it into re-usable package.
  • Continous code-reviews and tests.
  • Transparent platform to inspect.

Q: How can teams efficiently communicate and collaborate together?

  • Have the right tools and mindset in place.
  • Incentivise collaborative efforts.
  • Accept unexpected hypotheses and failures
  • Open mindedness.

Collaboration Torvalds Style

Tools used: E-mail, Git

Merge Git

https://www.youtube.com/watch?v=LE0JtUeyVJA

Thank you!

Check out our homepage at

https://www.quantargo.com

Presentation source available at

https://github.com/Quantargo/data-science-collaboration